29 research outputs found
Natural Language Processing and Language Technologies for the Basque Language
The presence of a language in the digital domain is crucial for its survival, as online communication and digital language resources have become the standard in the last decades and will gain more importance in the coming years. In order to develop advanced systems that are considered the basics for an efficient digital communication (e.g. machine translation systems, text-to-speech and speech-to-text converters and digital assistants), it is necessary to digitalise linguistic resources and create tools. In the case of Basque, scholars have studied the creation of digital linguistic resources and the tools that allow the development of those systems for the last forty years. In this paper, we present an overview of the natural language processing and language technology resources developed for Basque, their impact in the process of making Basque a “digital language” and the applications and challenges in multilingual communication. More precisely, we present the well-known products for Basque, the basic tools and the resources that are behind the products we use every day. Likewise, we would like that this survey serves as a guide for other minority languages that are making their way to digitalisation.
Received: 05 April 2022
Accepted: 20 May 202
Patterns of Text Readability in Human and Predicted Eye Movements
It has been shown that multilingual transformer models are able to predict human reading behavior when fine-tuned on small amounts of eye tracking data. As the cumulated prediction results do not provide insights into the linguistic cues that the model acquires to predict reading behavior, we conduct a deeper analysis of the predictions from the perspective of
readability. We try to disentangle the three-fold relationship between human eye movements, the capability of language models to predict these eye movement patterns, and sentence-level readability measures for English. We compare a range of model configurations to multiple baselines. We show that the models exhibit difficulties with function words and that pre-training only provides limited advantages for linguistic generalization
This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models
Although large language models (LLMs) have apparently acquired a certain
level of grammatical knowledge and the ability to make generalizations, they
fail to interpret negation, a crucial step in Natural Language Processing. We
try to clarify the reasons for the sub-optimal performance of LLMs
understanding negation. We introduce a large semi-automatically generated
dataset of circa 400,000 descriptive sentences about commonsense knowledge that
can be true or false in which negation is present in about 2/3 of the corpus in
different forms. We have used our dataset with the largest available open LLMs
in a zero-shot approach to grasp their generalization and inference capability
and we have also fine-tuned some of the models to assess whether the
understanding of negation can be trained. Our findings show that, while LLMs
are proficient at classifying affirmative sentences, they struggle with
negative sentences and lack a deep understanding of negation, often relying on
superficial cues. Although fine-tuning the models on negative sentences
improves their performance, the lack of generalization in handling negation is
persistent, highlighting the ongoing challenges of LLMs regarding negation
understanding and generalization. The dataset and code are publicly available.Comment: Accepted in the The 2023 Conference on Empirical Methods in Natural
Language Processing (EMNLP 2023
Building the Gold Standard for the surface syntax of Basque
In this paper, we present the process in the construction of SF-EPEC, a 300,000-word corpus syntactically annotated that aims to be a Gold Standard for the surface syntactic processing of Basque. First, the tagset designed for this purpose is described; being Basque an agglutinative language, sometimes complex syntactic tags were needed. We also account for the different phases in the construction of SF-EPEC
Noisy Channel for Automatic Text Simplification
In this paper we present a simple re-ranking method for Automatic Sentence
Simplification based on the noisy channel scheme. Instead of directly computing
the best simplification given a complex text, the re-ranking method also
considers the probability of the simple sentence to produce the complex
counterpart, as well as the probability of the simple text itself, according to
a language model. Our experiments show that combining these scores outperform
the original system in three different English datasets, yielding the best
known result in one of them. Adopting the noisy channel scheme opens new ways
to infuse additional information into ATS systems, and thus to control
important aspects of them, a known limitation of end-to-end neural seq2seq
generative models.Comment: 8 page
Data-driven sentence simplification: Survey and benchmark
Sentence Simplification (SS) aims to modify a sentence in order to make it easier to read and understand. In order to do so, several rewriting transformations can be performed such as replacement, reordering, and splitting. Executing these transformations while keeping sentences grammatical, preserving their main idea, and generating simpler output, is a challenging and still far from solved problem. In this article, we survey research on SS, focusing on approaches that attempt to learn how to simplify using corpora of aligned original-simplified sentence pairs in English, which is the dominant paradigm nowadays. We also include a benchmark of different approaches on common datasets so as to compare them and highlight their strengths and limitations. We expect that this survey will serve as a starting point for researchers interested in the task and help spark new ideas for future developments
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License